Skip to content

[V1][Core] Add a cache hit threshold for requests#24520

Open
kfirwolfson wants to merge 2 commits intovllm-project:mainfrom
kfirwolfson:feature/kv-cache-hit-threshold
Open

[V1][Core] Add a cache hit threshold for requests#24520
kfirwolfson wants to merge 2 commits intovllm-project:mainfrom
kfirwolfson:feature/kv-cache-hit-threshold

Conversation

@kfirwolfson
Copy link
Copy Markdown

@kfirwolfson kfirwolfson commented Sep 9, 2025

[V1][Core] Add a cache hit threshold for requests

Purpose

Introduce an optional KV-cache hit-rate gating mechanism, discussed in RFC #24256, to skip requests that are unlikely to benefit from prefill in P/D disaggregated deployments.

Edit: an additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.

What this PR adds

  • Global setting: --global-cache-hit-threshold ([0.0–1.0], default 0.0)
  • Per-request override: cache_hit_threshold ([0.0–1.0]) in incoming request ChatCompletionRequest / CompletionRequest (validated in the protocol layer).
  • Finish reason: New enum value and string "cache_threshold" exposed via v1 engine API. Requests rejected by this gating return HTTP 200 with finish_reason="cache_threshold" and no output tokens.
  • Config visibility & hashing: Threshold is included in VllmConfig and SchedulerConfig.
  • Bounds & validation: All threshold values validated to range [0.0, 1.0].

Why

  • Enables Decode-first optimization in P/D disaggregation: when computed-token ratio (local+external) over prompt length is below the threshold, we avoid scheduling low-benefit prefills on decode nodes. This reduces wasted work and remote KV transfers when cache reuse is insufficient.

Backwards compatibility

  • Default is 0.0 → feature is disabled by default. No behavior change unless the threshold is set globally or per request.

Test Plan

1) Unit Tests

Unit tests check the scheduler logic, including

  • request threshold overrides global threshold
  • cache hits from local or external KV cache, or both

2) E2E manual tests

Run vllm serve with --global-cache-hit-threshold 0.8 argument to set a some default value. We'll override it in most requests.

vllm serve <model_path> --served-model "Llama-3.1-8B-Instruct" --global-cache-hit-threshold 0.8

Scheduler computes hit_ratio = computed_tokens / prompt_tokens

We will send 4 requests. Note the order of sending them matters as the first request fills the cache other depend on

  • First request with cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache
  1. Request1: Short 26 tokens will be the prefix of future requests.
  • Following requests are sent with cache_hit_threshold: 0.33
  1. Request2: Long prompt ≈ 58 tokens → ratio 16/58 ≈ 0.28 → rejected as ratio below threshold
  2. Request3: Medium prompt ≈ 40 tokens → ratio 16/40 ≈ 0.4 → normal generation
  • The next request is sent without a cache_hit_threshold field, which means global value of 0.8 will take effect
  1. Request4: Medium prompt ≈ 39 tokens → ratio 16/39 ≈ 0.41 → rejected as ratio below global threshold

Request 1) Warm the cache

This run uses cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache for the base segment.

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size",
    "max_tokens": 20,
    "cache_hit_threshold": 0
  }'

Request 2) MISS case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size. Then we continue with many words so that the token length will exceed 16*3 and cache hit rate will be too low to pass the test case threshold",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"


Request 3) HIT case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix but continue with with whatever text tokens we like and keep it medium after all",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: normal generation ("finish_reason" is not "cache_threshold").

Request 4) MISS case using global threshold

Use global threshold set to 0.8

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix and now continue with different text so the hit rate will be too low",
    "max_tokens": 20
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"

Notes

  • Exact token counts can vary slightly by tokenizer/model; we go the numbers above using Llama-3.1-8B-Instruct

Test Result

E2E Local smoke tests on a single node:

  • Below threshold: responses returned 200 with finish_reason: "cache_threshold" and empty outputs.
    • Validated with debug logs
    • Request threshold:
      • Request cmpl-410004b615a54d73b7e9f0deebf2b852-0 rejected: cache hit rate 0.28 < threshold 0.33 (request)
    • Global threshold:
      • Request cmpl-6d66ba796f9247fcadca54ae428bf790-0 rejected: cache hit rate 0.41 < threshold 0.80 (global)
  • At/above threshold: normal token generation.
  • Validators rejected out-of-range values and accepted on boundaries 0.0 and 1.0 (not detailed above)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cache hit threshold to gate requests, which is a useful optimization for disaggregated deployments. The implementation is mostly solid, covering configuration, API exposure, and the core scheduling logic.

I've identified a critical issue that could lead to a ZeroDivisionError in the scheduler when processing requests with empty prompts. Additionally, there's a code duplication issue in the protocol validation that should be addressed to improve maintainability. My detailed comments provide suggestions for fixing these issues.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 3425995 to 7c0485e Compare September 9, 2025 16:31
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0b75346 to 8be6b61 Compare September 14, 2025 05:58
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat self tag

@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 8be6b61 to 0400566 Compare September 30, 2025 10:24
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 4d756b7 to 0c15acc Compare September 30, 2025 12:59
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Oct 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 3, 2025
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0c15acc to 0c9cb3f Compare October 6, 2025 06:06
@kfirwolfson
Copy link
Copy Markdown
Author

kfirwolfson commented Dec 11, 2025

Updated after merge of #26813. Would appreciate help in reviewing @markmc, @njhill, @tlrmchlsmth

@markmc
Copy link
Copy Markdown
Member

markmc commented Dec 19, 2025

I'm sorry we have been slow to give feedback on this. And all I have at this stage are some very high-level comments ...

For the decode preemption case - I think it's most natural to look at this first through the same lens as --kv-transfer-config '{"kv_load_failure_policy": "fail"} - a deployment-time choice that prefill should not happen in a decode instance

For the "decode-first P/D" case - this is introducing an alternative KV transfer protocol flow, an additional step before the current prefill-first flow kicks in, and might e.g. lead to additional kv_transfer_params in the prefill request

Both of these make sense, but I'm wary of the specifics of the proposal:

  • A single cache_hit_threshold concept might seem like an elegant solution to both, but to me it's a bit obtuse - it's not obvious (e.g. from the --help output) that this is something that (presumably?) will only interest KV transfer users
  • A threshold becomes yet another tunable knob, yet the way I've described the two features above doesn't immediately suggest that tuning is required
  • A per-request threshold is another leap in complexity and tunability, which feels premature to me

I think I'd be more immediately supportive of doing this in baby-steps, each of which could be a standalone PR:

  1. For the decode preemption case, add --kv-transfer-config '{"kv_decode_only": true}' or similar - I'd go so far as to deprecate the load failure policy config in favour of this one. I'm honestly a bit confused why we're not already using kv_role for this?
  2. For the decode-first mode, add --kv-transfer-config '{"enable_decode_first": true}'
  3. If there's a strong case to be made for having a tunable threshold, that can be added as KV transfer config
  4. Finally, if there's a strong case for per-request tuning of this threshold, that can build upon all of the above

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 19, 2025
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from a897a78 to b3123e1 Compare December 21, 2025 08:34
@mergify mergify bot removed the needs-rebase label Dec 21, 2025
@kfirwolfson
Copy link
Copy Markdown
Author

Thanks for the detailed discussion points, @markmc, answering inline.

For the decode preemption case - I think it's most natural to look at this first through the same lens as --kv-transfer-config '{"kv_load_failure_policy": "fail"} - a deployment-time choice that prefill should not happen in a decode instance

We think it’s a bit too strict to decide “no prefill at all” in the Decode instance. There are cases in which a little prefill on the Decode node makes sense, as exemplified below.

For the "decode-first P/D" case - this is introducing an alternative KV transfer protocol flow, an additional step before the current prefill-first flow kicks in, and might e.g. lead to additional kv_transfer_params in the prefill request

Not sure what you are referring to in “additional kv_transfer_params in the prefill request”. The “decode-first P/D” flow is an alternative to the current prefill-first flow. If there is enough cache, the Decoder will handle the request and there will be no remote prefill request (and no kv_transfer_params returned from the Prefiller). If the cache hit is low, then the flow continues in a similar way to the current prefill-flow: send the request to the Prefiller and then send it to the Decoder with the kv_transfer_params the Prefiller returned.
Note that in cases of a shared KV Cache storage there are no kv_transfer_params returned, as the Prefiller simply writes the cache directly to the shared storage and the Decoder reads them from the shared storage, as depicted in the diagram in the RFC #24256.

Both of these make sense, but I'm wary of the specifics of the proposal:
• A single cache_hit_threshold concept might seem like an elegant solution to both, but to me it's a bit obtuse - it's not obvious (e.g. from the --help output) that this is something that (presumably?) will only interest KV transfer users
• A threshold becomes yet another tunable knob, yet the way I've described the two features above doesn't immediately suggest that tuning is required

Not sure if you mean “KV Connector” or “KV Transfer”, as they are not exactly interchangeable. The feature can be useful for just KV-offloading using KV-Connectors as well (without direct transfers). One example is described under “Other Scenarios” in the RFC description, where cache is offloaded to local DRAM or disks but not transferred, and there is no P/D-D. A more practical use-case is a PD-D scenario with a shared KV Cache reachable by both Prefillers and Decoders, without direct KV Transfers between the two.
The help text talks about P/D optimizations, as the main use-case. Are you suggesting the “global-cache-hit-threshold” should be moved to the kv_transfer_params?

• A per-request threshold is another leap in complexity and tunability, which feels premature to me
I think I'd be more immediately supportive of doing this in baby-steps, each of which could be a standalone PR:

  1. For the decode preemption case, add --kv-transfer-config '{"kv_decode_only": true}' or similar - I'd go so far as to deprecate the load failure policy config in favour of this one. I'm honestly a bit confused why we're not already using kv_role for this?

As mentioned above and described in the RFC, some prefill work on the Decoder may be acceptable. This is also in “happy paths” and not just for Preemption – some connectors for example move data at the token-block or chunk of blocks resolution. In those cases, the cache will not contain the tail of the request (beyond block/chunk boundaries), which will be calculated using prefill on the Decoder.

  1. For the decode-first mode, add --kv-transfer-config '{"enable_decode_first": true}'

Decode first is a router/sidecar flow, not something vLLM needs to be aware of. vLLM doesn’t really have has anything to do with this information, as it just handles incoming requests as they come. The global or per-request threshold allows the router/sidecar to control the flow (the diagram in the RFC shows this). A flag would not suffice and we need a tunable threshold.

  1. If there's a strong case to be made for having a tunable threshold, that can be added as KV transfer config
  2. Finally, if there's a strong case for per-request tuning of this threshold, that can build upon all of the above

We believe the tunable threshold per request gives the highest flexibility. Without it the system is fixed and the threshold has to be determined at loading time. Thinking about it some more, if we want to reduce complexity, we can decide to avoid a global-threshold altogether and use only per-request thresholds.

An example usage for this flow can be seen in the parallel work now being done for adding Decode-first support to llm-d, see feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present

If you want, we would love to set up some time to review the suggestion as a whole.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jan 20, 2026

Hi @kfirwolfson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 3d5fba0 to f5f75ef Compare January 20, 2026 08:34
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jan 20, 2026

Hi @kfirwolfson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jan 28, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 28, 2026
Kfir Wolfson added 2 commits January 28, 2026 11:29
Fix Gemini CR comments
Add unit tests
Move from SamplingParams to request
unit test remake
fix static code analysis rejects
Fix unit test
fix after local CR
fix pre-commit reject
add threshold to request logger and fix some calls to encode
fix ruff

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
…oject#32726 review

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 641a7c8 to d11d1fa Compare January 28, 2026 09:31
@mergify mergify bot removed the needs-rebase label Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants